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Preface 


Linear  predictive  analysis  and  synthesis  of  speech  is 
used  as  a  basis  for  implementing  low  bit  rate  voice  trans¬ 
mission  and  minimal  digital  storage  of  speech  information. 

The  synthesized  speech  resulting  from  the  linear  predictive 
aralysis/synthesis  of  speech  degraded  by  background  noise 
is  of  poor  quality.  This  report  describes  some  of  the 
methods  proposed  to  improve  the  quality  of  this  speech  and 
describes  the  implementation  and  performance  of  one  of  these 
methods. 

Thanks  are  due  to  my  thesis  advisor.  Captain  Larry  Kizer, 
for  overall  guidance  through  this  research.  Also,  Captain 
Kizer  did  a  tremendous  job  ensuring  that  our  Data  General 
computers  were  maintained  and  operational.  Professor 
Matthew  Kabrisky  was  most  helpful  in  providing  insight  into 
subjective  effects  of  digital  signal  processing  of  speech. 
Lieutenant  Robin  Simmons  is  thanked  for  his  assistance  in 
the  areas  of  computer  system  programming  and  understanding. 
Major  Ken  Castor  was  helpful  in  reviewing  this  report. 

Lastly,  I  wish  to  thank  my  typist  (decoder),  Ms.  Sharon 
Gabriel,  for  doing  a  good  job  on  this  report. 

Christopher  L.  Batchelor 
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Abstract 

Methods  for  improving  the  quality  of  the  speech  resulting 
from  linear  predictive  analysis/synthesis  of  speech  degraded 
by  background  noise  are  discussed.  A  method  of  noise  cancel¬ 
lation  using  Wiener  filtering  in  the  frequency  domain  with  the 
short-time  Fourier  transform  was  chosen  for  implementation. 
Implementation  was  done  on  a  Data  General  Nova/Eclipse  digital 
signal  processing  system  in  FORTRAN  5.  Speech  degraded  by 
white  gaussian  noise  was  processed  through  linear  predictive 
analysis/synthesis  with  and  without  noise  cancellation  pre¬ 
processing.  Preliminary  laboratory  listenings  verified  that 
an  improvement  in  quality  was  achieved  with  noise  cancellation 
preprocessing.  Although  improvement  in  quality  was  achieved, 
more  effort  is  required  to  make  this  implementation  more 
efficient  and  improve  the  quality  of  speech  produced. 
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I.  INTRODUCTION 


Linear  predictive  coding  CLPC)  is  a  successful  analysis/ 
synthesis  system  for  bandwidth  compression  of  speech.  Research 
indicates  that  LPC  based  systems  degrade  quickly  when  proces¬ 
sing  speech  degraded  by  background  noise  (Ref  12:6).  Thus,  it 
is  of  interest  to  apply  a  digital  noise  cancellation  technique 
to  noise  corrupted  speech  before  LPC  analysis/synthesis  and 
then  evaluate  that  technique's  effectiveness  in  reducing  degra¬ 
dation  of  quality. 

/ 

Background 

Linear  prediction  analysis  is  based  on  the  idea  that  a 
speech  sample  can  be  approximated  as  a  linear  combination  of 
past  speech  samples.  Speech  can  be  modeled  as  the  output  of 
a  linear,  time-varying  system  excited  by  either  quasi-periodic 
pulses  (voiced  speech)  or  white  noise  (unvoiced  speech) 

(Ref  13:38-106).  Figure  1  is  a  block  diagram  of  this  model. 

The  time-varying  digital  filter  shown  has  the  steady-state 
system  function  of  the  form: 


H(z)  = 


P 


1 


Figure  1.  LPC  Vocal  Tract  Model 


This  system  function  is  an  all-pole  model  and  the  poles  define 
the  resonances  or  formants  of  the  model  as  determined  by  the 
coefficients  (a^).  The  number  p  is  the  order  of  the  model. 

The  application  of  linear  predictive  analysis  to  encoding 
speech  for  low  bit  rate  transmission  or  storage  is  termed  LPC 
(linear  predictive  coding).  The  LPC  analysis  parameters  are 
the  coefficients  (a^),  the  gain  parameter  G,  the  pitch  period, 
and  a  voiced-unvoiced  parameter.  Figure  2  shows  a  block  dia¬ 
gram  of  an  LPC  vocoder.  The  transmitter  codes  the  LPC  analysis 
parameters  for  transmission  through  the  channel  and  the  receiver 
decodes  the  parameters  and  synthesizes  the  output  speech.  The 
LPC  analysis  parameters  can  be  estimated  by  many  different 
methods,  as  described  in  Digital  Processing  of  Speech  Signals 
by  L.R.  Schafer  and  R.W.  Rabiner  (Ref  13).  These  methods  of 
estimation  for  the  coefficients  (a^)  (which  determine  formants) 
are  not  noise  tolerant. 

Experimental  research  has  demonstrated  that  four  major 
differences  exist  between  the  all-pole  linearly  predicted 
spectra  of  clean  and  noisy  speech  (Ref  13:29-30).  First, 
there  is  a  loss  of  low  energy  formant  information.  Secondly, 
the  formant  frequencies  are  shifted.  Thirdly,  the  bandwidth 
of  each  formant  is  wider.  Lastly,  an  overall  decrease  of 
spectral  dynamic  range  exists.  If  the  signal  to  noise  ratio 
is  not  too  low,  it  has  been  observed  that  the  primary  percep¬ 
tual  effect  is  generation  of  "musical  tone"  like  sounds  in 
the  background  which  causes  degradation  of  speech  quality 
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Figure  2.  LPC  Vocoder  Model 
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(Ref  8:601).  Apparently,  estimation  of  the  coefficients 
(a^)  is  not  accurate  due  to  the  addition  of  noise. 

Problem 

The  objective  of  this  thesis  is  to  examine  methods  to 
improve  the  estimation  of  the  LPC  coefficients  of  noisy  speech 
and  implement  one  of  these  methods  on  the  Data  General  Nova- 
Eclipse  speech  processing  system.  After  implementation,  per¬ 
formance  of  the  method  was  to  be  determined  subjectively. 

Scope 

The  methods  to  improve  estimation  of  LPC  coefficients  to 
be  examined  will  include  noise  cancellation  pre-processing 
and  noise  tolerant  estimation  procedures.  This  thesis  will 
not  discuss  estimation  of  the  LPC  coefficients  using  pole- 
zero  modeling  because  preliminary  results  indicate  that  the 
increased  complexity  of  pole-zero  modeling  does  not  improve 
performance  of  LPC  (Ref  8) 

Noise  cancellation  preprocessing  was  chosen  over  noise 
tolerant  estimation  procedures  for  implementation.  A  noise 
cancellation  preprocessor  can  be  used  for  other  purposes  to 
aid  in  robust  speech  processing. 

Approach 

Each  method  to  improve  estimation  of  coefficients  will 
be  described.  Also,  advantages  and  disadvantages  of  each 
method  shall  be  discussed.  Next,  a  detailed  description  of 
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the  implementation  of  noise  cancellation  technique  using 
short-time  Fourier  analysis  will  be  given  with  accompanying 
results.  Finally,  recommendations  for  further  research  in 
this  area  shall  be  covered. 
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II.  NOISE  CANCELLATION  METHODS 


This  section  describes  techniques  which  can  be  applied 
to  improve  LPC  analysis/synthesis  of  speech  corrupted  with 
noise.  The  techniuqes  described  include  time  domain  Wiener 
filtering,  frequency  domain  linear  prediction  filtering, 
channel  noise  vocoder  filter  bank  analysis  filtering,  linear 
prediction  coefficient  estimation  using  the  Maximum  A  Posteriori 
(MAP)  method,  phase  corrected  spectral  subtraction,  and  fre¬ 
quency  domain  Wiener . filtering  using  the  short-time  Fourier 
transform.  Unless  otherwise  specified,  the  following  descrip¬ 
tions  are  summaries  of  the  reference  given. 

Time  Domain  Wiener  Filtering 

This  technique  is  the  application  of  a  Wiener  linear 
prediction  filter  to  reduce  additive  noise  provided  that 
the  signal  bandwidth  is  significantly  less  than  the  band¬ 
width  of  the  additive  noise  (Ref  1).  The  filter  would  be 
applied  as  a  preprocessor  before  LPC  analysis/synthesis. 

The  implementation  is  in  the  time  domain  using  the 
Widrow-Hoff  LMS  algorithm.  Figure  3  is  a  schematic  diagram 
of  a  finite  impulse  response  Wiener  filter.  The  coeffi¬ 
cients  W* (k) ,  (k=0,l, . . . ,L-1)  are  chosen  to  minimize  the 
power  in  the  error  signal  e  ( j )  and  satisfy  Eq  (2).  i()xx(k) 
is  defined  as  the  autocorrelation  of  the  input  signal  x. 

<J>  (i.)  is  the  cross  correlation  between  the  input  x  and 

xy 

the  output  y. 


7 


Figure  3.  Finite  Impulse 


L-l 

I  4*  (*-k)W*(k)  =  4>  (&+A)  (2) 

k=0  xx  xy 

Noise  suppression  results  from  the  fact  that  the  decorrela¬ 
tion  time  for  broadband  noise  is  smaller  than  that  for  the 
narrowband  signal.  Therefore,  it  is  possible  to  choose  a 
value  for  A  which  will  prevent  the  noise  components  from 
appearing  in  the  output. 

The  advantages  of  this  technique  are  that  the  prediction 
distance  A  can  be  chosen  to  provide  optimum  results  for  the 
particular  noise  environment  and  no  external  reference  noise 
signal  need  be  provided. 

This  technique  has  disadvantages  also.  First,  no  sub¬ 
jective  experimental  results  are  provided  for  performance 
for  the  filter  in  conjunction  with  LPC  analysis/synthesis. 
Research  has  indicated  that  echo  problems  exist  with  time 
domain  implementations  of  Wiener  filters  (Ref  2:694). 

Lastly,  the  signal  bandwidth  must  be  significantly  less 
than  the  bandwidth  of  the  additive  noise. 

Frequency  Domain  Linear  Prediction  Filter 

This  technique  attempts  to  modify  the  LPC  analysis/ 
synthesis  process  to  account  for  corruption  of  the  speech 
with  noise.  Specifically,  the  speech  extraction  problem 
is  regarded  as  a  parameter  estimation  problem  (Ref  6). 
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The  method  assumes  that  the  power  spectral  density  of 
the  noise  is  known  (noise  is  stationary)  and  that  the  statis¬ 
tics  of  speech  and  noise  are  both  gaussian.  A  periodgram  of 
windowed  speech  corrupted  with  noise  is  calculated  by  Eq  (3) . 

|XT(f)|2  =  |ST(f)|2  +  |DT(f)|2 

+  2  •  Re  [ST(f)-D*(f)]  (3) 

XT(f),  S^,(f)  and  D^.(f)  are  the  Fourier  transforms  of  win¬ 
dowed  noise  corrupted  speech,  speech,  and  noise,  respectively. 

The  unbiased  estimate  of  the  speech  spectrum  is  given  by 

Eq  (4). 

|ST(f)|2  =  |XT(f)|2  -  E[|DT(f)|2  +  2  Re(ST(f)  D*(f)}) 

(4) 

This  estimate  of  the  speech  spectrum  is  smoothed  with  a  spectral 
window  producing  the  spectral  envelope.  Then  inverse  Fourier 
transforming  the  spectral  envelope  gives  the  autocorrelation 
coefficients  used  in  linear  predictive  analysis  (Ref  13:401-403). 

The  advantage  of  this  method  is  that  noise  reduction  can 
be  tailored  for  the  particular  noise  environment  encountered 
by  the  system. 
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The  disadvantage  of  this  method  is  that  it  is  not  simple 
to  modify  existing  LPC  analysis/synthesis  systems  to  estimate 


the  spectral  envelope  in  this  manner. 

Vocoder  Filter  Bank  Analysis 

This  techniuqe  performs  a  spectral  decomposition  of  noisy 
speech  via  channel  vocoder  analysis- and  attenuates  each  spec¬ 
tral  component  depending  on  how  much  the  measured  speech  plus 
noise  power  exceeds  an  estimate  of  the  background  noise  power 
(Ref  11).  This  filter  would  be  used  as  a  preprocessor  before 
LPC  analysis/synthesis. 

A  two  state  model  for  the  speech  event  is  applied  in 
determining  the  maximum  likelihood  estimator  of  the  speech 
power.  This  model  resulted  in  a  class  of  suppression  curves 
which  permits  a  tradeoff  of  noise  suppression  against  speech 
distortion.  Real-time  experiments  have  shown  that  the  noise 
can  be  made  imperceptible  by  proper  choice  of  a  suppression 
factor,  but  distortion  increases  as  the  input's  signal  to 
noise  ratio  decreases. 

The  noise  suppression  filter  consists  of  a  bank  of  second 
order  Butterworth  bandpass  filters  which  span  the  frequency 
range  120-3270  Hertz.  Figure  4  is  a  block  diagram  of  this 
system.  Measurements  must  be  made  to  determine  the  instan¬ 
taneous  signal  power  and  the  average  signal  power  at  the 
output  of  each  of  the  channel  filters  in  order  to  compute  the 
channel  gains.  Experimentation  shoed  that  a  four  second 
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histogram  of  the  frame  energies  of  the  signal  was  bimodal 
(Ref  11:700).  A  threshold  could  be  set  between  the  modes, 
and  frames  which  were  speech  absent  could  be  determined  with 
high  probability.  Equation  (5)  defines  the  nth  channel 
measurement  parameter  (gn(m)). 


*n(B) 


n 


(m)  -  Vn(m-1) 


(5) 


A  soft  decision  point  of  view  determines  a  class  of  suppres¬ 
sion  curves  defined  by  Eq  (6). 


Gn(m) 


K 


1  ,  exp(-OI  (2/r/fi  -  g_(m))) 

y(l  +  /OH)  - 5 - -■-■■■  - 

2  n  i  +  exP(-ai0(2/f/a  -  gnW)) 

,  («) 

suppression  factor  (7) 


lo00 


1 

Tn 


2ir 

f  exp(x  cos  0)d0  (modified 
®  Bessel  function) 


(8) 


In  a  real-time  implementation,  the  measurement  parameter 
gn(m)  is  used  as  a  pointer  for  a  table  look-up  to  determine 
the  attenuation  G  (m) .  To  avoid  discontinuities,  a  smoothed 
gain  Gn(m)  is  calculated  and  applied  to  the  appropriate 
channel.  The  channel  waveforms  are  then  added  together 
to  produce  the  prefiltered  waveform. 


This  prefilter  has  three  primary  advantages.  First,  it 
is  possible  to  select  a  suppression  factor  which  would  optimize 


intelligibility  for  a  given  signal  to  noise  ratio.  Secondly, 
it  is  possible  to  integrate  the  prefilter  with  efficient 
channel  vocoder  implementations.  Lastly,  no  reference 
noise  signal  must  be  provided. 

The  disadvantage  of  this  method  is  that  it  is  relatively 
more  complex  and  would  be  more  difficult  to  implement. 

Linear  Prediction  Coefficient  MAP  Estimation 

The  Maximum  A  Posteriori  (MAP)  estimation  procedure  can 
be  used  to  estimate  the  LPC  coefficients  from  speech  waveforms 
degraded  by  additive  white  gaussian  noise.  But  this  procedure 
requires  solving  a  set  of  non-linear  equations  which  require 
too  much  computation  time.  However,  the  true  MAP  estimation 
procedure  can  be  approximated  by  an  iterative  method  that 
requires  the  solution  of  sets  of  linear  equations  (Ref  8). 

Equation  (9)  describes  noisy  speech  y  (n)  as  the  sum  of 
speech  (s(n))  and  white  gaussian  noise  (d(n)), 

y  (n)  -  s  (n)  +  d  (n)  (9) 

The  LPC  coefficients  can  be  written  in  vector  form  a.  The 
MAP  estimate  of  a  is  the  vector  that  maximizes  p(a/^)  (the 
probability  density  of  a  conditioned  on  jr)  .  It  can  be  shown 
that  maximizing  P(a/y)  is  a  non-linear  problem  (Ref  8).  A 
"suboptimal"  MAP  procedure  is  proposed  which  estimates  s^  and 
a  by  maximizing  p(a,  s/y) .  In  the  iterative  procedure,  an 
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initial  estimate  a  is  obtained,  then  s  is  estimated  by 

—  o  — 

A  A 

E{s/aQ,  y) .  With  estimate  £,  a  new  estimate  a  ^  is  obtained 
from  linear  predictive  analysis.  With  the  new  a  ^  ,  the 

A 

above  procedure  is  repeated  obtaining  a  ^  >  etc.  Estimating 
£  by  E(s/a  ^  ,  y)  is  a  linear  problem  and  this  iterative  pro¬ 
cedure  converges  to  a  solution  that  is  at  least  a  local 
maximum  of  p(a,  s/y)  (Ref  13:201-203).  Also,  each  estimate 

A 

of  s^  by  E{s(n)/a,  y}  can  be  approximated  by  filtering  y(n) 
by  a  non-causal  Wiener  filter  (Eq  (10)). 

ps(w) 

H(w)  '  psM  *  PdM 

Pj(w)  is  the  power  spectral  density  of  the  noise. 

The  primary  disadvantage  of  this  technique  is  that  the 
development  is  done  for  the  white  gaussian  noise  case.  A 
viable  noise  tolerant  estimation  of  LPC  coefficients  must 
take  into  account  different  noise  environments. 

Phase  Corrected  Spectral  Subtraction 

Spectral  subtraction  noise  cancellation  preprocessing 
has  been  implemented  to  improve  the  quality  of  LPC  speech. 
This  method  helps,  but  it  introduces  "musical  noise"  prob¬ 
lems.  A  new  approach  to  spectral  subtraction  noise  cancel¬ 
lation  has  been  proposed  which  avoids  the  artificial  phase 
distortion  which  is  said  to  produce  the  "musical  noise" 

(Ref  10). 


(10) 
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Figure  5  is  a  block  diagram  of  the  old  approach  to 
spectral  subtraction  noise  cancellation.  The  time  domain 
output  signal  is  constructed  using  the  phase  of  the  signal 
plus  noise  input.  Because  of  this  artificial  phase  dis¬ 
tortion,  the  output  does  not  conform  to  a  linear  predictive 
model.  According  to  the  new  approach,  this  problem  is 
eliminated  when  the  subtraction  is  done  at  the  point  where 
the  autocorrelation  is  calculated  or  when  the  covariance 
matrix  is  computed  in  linear  prediction  analysis. 

Figure  6  is  a  diagram  of  the  new  approach  in  which  the 
autocorrelation  lags  are  modified.  If  the  input  noise  is 
assumed  white,  then  the  pre-whitening  filter  can  be  dis¬ 
carded. 

The  covariance  matrix  can  be  modified  also  to  effect 
noise  cancellation.  For  white  noise,  only  the  diagonal 
terms  of  the  covariance  matrix  need  be  reduced  by  the  noise 
power  for  the  matrix  to  conform  to  the  noise-free  case. 

The  non-white  noise  case  would  require  a  time  varying  pre¬ 
whitening  filter. 

The  primary  advantage  of  this  method  is  that  no  noise 
reference  channel  need  be  provided. 

Noise  Cancellation  Using  the  Short  Time  Transform 

The  time  domain  implementation  of  a  Wiener  least  squares 
filter  to  preform  noise  cancellation  is  ineffective  when  the 
noise  characteristics,  e.g.,  mean,  variance,  etc.  change 
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rapidly  with  time.  Thus,  a  frequency  domain  approach  using 
the  short  time  Fourier  transform  (FFT)  to  estimate  the 
required  Wiener  filter  is  proposed  (Ref  2).  Using  the  effi¬ 
ciency  of  the  FFT  results  in  a  computation  rate  which  is 
proportional  to  the  filter  length  times  the  log  of  the 
filter  length.  Therefore,  the  FFT  approach  is  a  viable 
alternative  for  real  time  implementation. 

Figure  7  is  a  block  diagram  of  the  filter’s  construction. 
The  signal  x  represents  speech  plus  noise  and  v  represents 
a  noise  reference  channel.  S  (f)  is  the  cross  power  spectral 
density  between  speech  plus  noise  and  noise.  Svv(f)  is  the 
power  spectral  density  of  the  noise.  W(f)  is  the  estimated 
Wiener  filter.  A  more  complete  description  of  this  filter 
will  be  given  in  the  next  section,  because  this  filter  was 
chosen  for  implementation. 

There  are  three  primary  advantages  of  this  approach. 

First,  experimental  comparisons  have  shown  that  the  compu¬ 
tational  efficiency  is  3.5  times  as  great  as  time  domain 
methods.  This  means  that,  for  reverberant  high  noise 
environments  requiring  large  filter  lengths,  implementation 
in  real  time  may  be  accomplished.  Secondly,  although  the 
filter  structure  shown  in  Figure  7  requires  a  reference 
noise  channel,  it  is  possible  to  obtain  estimates  of  the 
background  noise  spectrum  during  speaker-silent  segments. 
Lastly,  no  a  priori  information  about  the  energy  in  the 
reference  channel  is  needed  in  order  to  pick  a  prediction 
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ck  Diagram  of  Short  Time  Transform  Noise  Cancellor 


distance  as  in  some  time  domain  methods.  If  the  prediction 
distance  is  marginally  too  large,  then  the  algorithm  will 
produce  echo.  If  too  small,  the  algorithm  will  be  slow  to 
converge.  This  gain  adjustment  is  taken  care  of  auto¬ 
matically  in  the  frequency  domain  approach.  Therefore, 
higher  quality  output  is  produced  free  from  echo  which  has 
been  demonstrated  already. 

There  are  two  disadvantages  of  this  method.  It  will 
require  considerably  more  memory  than  time  domain  methods. 
Also,  it  will  be  much  more  complex  to  program  than  time 
domain  methods. 


■■ 


III .  Implementation  of  Noise  Cancellation  Technique 

I  chose  to  implement  the  short-time  transform  technique 
of  noise  cancellation.  This  technique  was  described  in 
the  previous  section  and  is  depicted  in  Figure  8.  This  tech¬ 
nique  was  chosen  for  the  following  reasons.  First,  it  is 
a  preprocessor  and  thus  can  be  used  directly  to  improve  LPC 
processing  without  modification  of  the  LPC  implementation. 
Also,  as  a  preprocessor,  it  may  be  used  to  enhance  the  per¬ 
formance  of  other  speech  processing  procedures  being  devel¬ 
oped  (one  example  is  phoneme  recognition) .  Secondly,  this 
frequency  domain  technique  had  faster  adaptation  time  than 
time  domain  implementations  of  the  Wiener  filter  (Ref  2). 
Thirdly,  preliminary  subjectivity  results  indicated  that 
frequency  domain  methods  do  not  have  echo  problems,  as  dis¬ 
cussed  in  the  last  section.  Lastly,  it  could  be  implemented 
in  the  time  allowed  with  greater  ease  than  the  frequency 
domain  vocoder  method  discussed  earlier.  The  frequency 
domain  vocoder  shares  with  the  short-time  transform  method 
the  advantages  of  faster  adaptation  time  and  lack  of  echo 
problems,  but  is  much  more  complex  and  more  difficult  to 
implement.  Therefore,  the  short-time  transform  method  was 
chosen  over  methods  which  modified  the  LPC  implementation, 
were  implemented  in  the  time  domain,  and  required  too  much 
complexity  for  implementation  in  the  time  allotted. 
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General  System  Information 

This  technique  was  implemented  in  Fortran  5  on  a  Data 
General  Nova-Eclipse  signal  processing  system.  Analog 
speech  data  are  low-pass  filtered  at  4.0  kilohertz  and 
sampled  at  8000  samples/second  (approximately  the  Nyquist 
rate).  Speech  files  consist  of  88  blocks  (256  words/block) 
of  integers  ranging  from  -2048  to  2048  (-5  to  +5  Volts). 

More  specific  information  about  input/output  of  speech  files 
through  program  AUDIOHIST,  written  by  Paul  Finkes,  is  avail¬ 
able  (Ref  14). 

Short-Time  Transform  Noise  Cancellation 

The  implementation  of  this  frequency  domain  method  is 
based  on  an  article  by  Lawrence  R.  Rabiner  and  Jant  B.  Allen, 
titled  "Short-Time  Fourier  Analysis  Techniques  for  FIR  System 
Identification  and  Power  Spectrum  Estimation"  (Ref  14). 
Specifically,  Rabiner  and  Allen  define  the  short-time  Fourier 
transform  of  a  signal  x(n),  at  time  raR  as  Eq  (12): 

mR  ~ 

X(m,k)  =  l  x  (n)w(mR-n)  exp(-j  4-  kn)  (12) 

x=mR-L+l  N 

R  is  the  period  between  estimates  of  the  short-time  transform 
of  the  signal  and  w(n)  is  a  causal  FIR  window  of  duration  L 
samples.  They  show  if  the  signal  plus  noise  signal  input  is 
defined  as  x(n)  and  the  noise  input  v(n),  then  the  unbiased 
Wiener  filter  estimate  is  given  by  Eq  (13). 
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(13) 


SVXW 

H(K)  = 


where 


SvxW 


P-1  "2 

I  I  X (m , K)  V* (m+q, K) 
m=0  q=q^ 


SVV(K) 


p-1  q2 

I  I  V(m,K)  V*  (m+q , K) 
m=0  q=q1 


integer  part  o£  [(L+M-2)/R] 


q2  =  integer  part  of  [(L-l)/R] 


$  *  estimate  of  the  system’s  impulse  response  duration 


p  =  number  of  analysis  sections. 


Bounds  are  defined  for  the  parameters  t  and  R  in  Eqs  (16) 
and  (17). 


L  >  M 


For  a  Hamming  window, 


R  1  L/4 


For  this  implementation,  M  was  chosen  to  be  equal  to  L 
or  128  samples  (16  msec).  R  is  chosen  to  be  32  samples 
(4  msec).  The  parameters  q^  and  q2  are  then  calculated  as 
-7  and  3.  The  number  of  analysis  sections  p  is  left  as  a 
variable  whose  effect  is  to  be  evaluated  because  no  information 
relating  to  this  specific  application  was  available  in  order 
to  choose  a  value.  Rabiner  and  Allen  do  show  that,  as  p 
increases,  less  error  is  made  in  the  estimate  H(K)  (Ref  14: 
190-191).  The  parameter  p  is  related  to  the  total  number 
of  points  used  in  the  analysis  N'  by  Eq  (18) 

p  =  integer  part  [-—  ^  ]  (18) 

The  process  of  noise  cancellation  is  implemented  can  be 
described  as  follows.  First,  weight  128  point  sequences  of 
both  channels  with  Hamming  windows  and  compute  N  point  FFT  by- 
zero  filling  from  129  to  N.  Repeat  previous  step  (except 
shift  32  samples)  until  p  +  q2  -  q^  FFTs  have  been  calculated 
and  stored.  Then,  calculate  Svx(K)  and  Svv(K)  according  to 
Eqs  (14)  and  (15)  by  correcting  the  relative  phase  between 
successive  FFTs  at  each  analysis  frame  M  due  to  the  shift  of 
32  samples  between  successive  FFTs.  H(K)  is  then  calculated 
according  to  Eq  (13) .  A  FFT  of  the  noise  reference  channel 
V(0,K)  is  multiplied  by  H(K)  and  then  inverse  Fourier  trans¬ 
formed  to  produce  the  sampled  sequence  of  correction  signal 


26 
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on  the  interval  1  to  128.  The  stored  FFTs  of  the  following 
two  channels  are  updated  by  shifting  in  memory  and  storing 
one  new  FFT  of  both  channels  in  the  space  opened  up  after 
shifting.  Then  H(K)  is  calculated  again  as  above,  repeating 
the  process.  A  new  sampled  sequence  of  the  correction  signal 
is  produced  and  added  with  a  75%  overlap  of  the  previous 
sequence  calculated.  After  reconstruction,  the  correction 
sequence  is  subtracted  from  the  speech  plus  noise  channel, 
completing  the  operation.  Appendix  A  gives  a  description  of 
how  this  process  was  implemented. 
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IV.  Results 


The  short-time  noise  cancellation  implementation  was 
used  to  process  files  consisting  of  2.0  seconds  of  speech 
plus  noise.  Each  speech  plus  noise  file  was  generated  with 
a  predefined  speech  to  noise  ratio  by  program  SPLUSN.  The 
noise  used  as  the  reference  channel  and  by  SPLUSN  is  white 
gauss ian  and  generated  by  program  NOISE.  Both  NOISE  and 
SPLUSN  are  described  in  Appendix  A. 

The  time  required  to  process  each  2.0  second  speech  file 
using  six  analysis  frames  per  estimation  was  excessive 
(approximately  5.2  hours).  The  execution  time  is  approxi¬ 
mately  proportional  to  the  number  of  frames  per  estimation 
since  processing  with  three  analysis  frames  per  estimation 
required  2.6  hours.  The  total  run-time  can  be  approximately 
broken  down  into  three  areas.  It  took  27.8%  of  the  time  to 
read  and  write  complex  arrays  to  memory.  It  took  4.22%  of 
the  time  to  calculate  1024  point  DFTs.  Also,  it  took  72.16% 
of  the  time  to  perform  all  other  processing  which  mostly 
included  complex  arithmetic  and  logical  operations. 

The  noise  cancellor  performance  is  depicted  by  Figures 
9,  10  and  11.  Figure  9  is  a  plot  of  four  sequential  segments 
(each  segment  is  .128  seconds  long)  of  speech.  Figure  10  is 
a  plot  of  the  same  speech  segments  plus  noise  (SNR  =  0  db,  as 
defined  by  program  SPLUSN  in  Appendix  A  using  B  weighting 
curves  on  speech  and  noise  energies).  Figure  11  is  a  plot 
of  the  output  of  the  noise  cancellor  when  the  speech  plus  noise 
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Speech  Waveform 


Figure  10.  Speech  Plus  Noise  Waveform 


a 


in  Figure  10  is  used  as  its  input  and  six  analysis  frames 
were  used  per  estimation.  Note  that  little  noise  appears 
in  the  output,  but  the  speech  waveform  is  distorted.  It  is 
more  difficult  to  visually  pick  out  the  glottal  pulse  in  the 
speech  waveform.  I  listened  to  output  speech  and  it  sounded 
like  whispered  speech.  Also,  little  noise  was  heard  in  the 
output . 

Noise  cancellation  was  performed  with  three  and  six 
analysis  frames  per  estimation. 

It  was  noted  that  the  output  speech  was  less  whisper¬ 
like  (higher  quality)  with  six  analysis  frames  per  estimation. 
However,  the  output  speech  was  still  whisper-like  and  of  low 
quality. 

Next,  I  used  our  Interactive  Laboratory  System  (ILS)  to 
perform  linear  predictive  analysis/synthesis  on  speech  plus 
noise  and  noise  cancelled  speech.  The  use  of  the  ILS  system 
in  this  application  is  described  in  Appendix  B.  Also,  the 
details  on  how  the  ILS  system  performs  linear  predictive 
analysis/synthesis  are  given  in  the  ILS  application  note 
number  1  entitled  "Speech  Analysis  and  Synthesis"  (Ref  15) . 

The  number  of  points  per  analysis  frame  used  in  the  LPC 
analysis  was  chosen  to  be  128.  The  number  of  coefficients 
estimated  was  chosen  to  be  10.  Also,  analysis  frames  were 
windowed  with  a  standard  Hamming  window. 

The  synthesized  speech  is  depicted  in  Figures  12  and  13. 
Figure  12  is  the  LPC  processed  version  of  the  speech  plus 
noise  shown  in  Figure  10.  Figure  13  is  the  LPC  processed 
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PC  Synthesized  Speech  Plus  Noise 


Figure  13.  LPC  Synthesized  Noise  Cancelled  Speech 


version  of  the  noise  cancelled  speech  shown  in  Figure  11. 

After  listening,  the  LPC  processed  speech  plus  noise  was 
perceived  to  be  of  very  poor  quality.  The  LPC  processed 
noise  cancelled  speech  retained  some  of  its  whisper-like 
quality,  but  was  of  much  higher  quality  than  the  LPC  processed 
speech  plus  noise. 


35 


V.  Conclusions 


ms 


This  short-time  noise  cancellation  method  was  capable 
of  improving  the  quality  of  LPC  processed  speech  plus  noise, 
but  this  implementation  has  two  major  weaknesses.  First, 
this  implementation  takes  an  excessive  amount  of  time  to 
perform  noise  cancellation.  Secondly,  the  output  speech 
produced  by  this  implementation  sounded  like  whispered  speech. 

This  implementation  takes  an  excessive  amount  of  time 
for  the  following  three  reasons.  First,  27.8%  of  the  pro¬ 
cessing  time  is  devoted  to  reading  and  writing  1024  point 
complex  arrays  (4096  words)  to  disk.  Secondly,  the  Data 
General  Nova/Eclipse  performs  floating  point  instead  of 

r 

integer  complex  arithmetic.  Integer  arithmetic  is  carried 
out  quicker,  but  floating  point  arithmetic  allows  greater 
dynamic  range  in  processing.  Lastly,  the  DFT  size  was 
chosen  to  be  1024  and  may  be  reduced  to  512  without  reducing 
the  performance  of  the  cancellor.  A  DFT  size  of  512  would 
cut  in  half  the  time  necessary  to  read  and  write  complex 
arrays  to  disk  and  the  time  required  for  complex  arithmetic 
operations . 

The  whispered  speech  effect  of  the  output  from  the 
cancellor  was  reduced  when  the  number  of  analysis  frames 
used  per  estimation  was  doubled  from  three  to  six.  The 
effect  of  increasing  the  analysis  frames  used  in  each  cal¬ 
culation  of  estimates  of  the  cross -spectrum  and  the  auto- 
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spectrum  is  that  these  spectra  are  smoothed  more  over  time. 
Thus,  the  Wiener  filter  estimate  does  not  change  as  rapidly 
over  time  and  will  not  produce  as  much  modulation  of  the 
speech  waveform.  Additional  smoothing  could  be  implemented 
by  smoothing  the  spectrum  estimates  by  taking  partial  sums 
of  past  and  present  spectrum  estimates. 
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VI .  Recommendations 

The  areas  for  future  research  into  the  characteristics 
of  performance  of  this  implementation  of  the  short-time 
transform  noise  cancellation  include  the  following: 

1.  Whether  or  not  the  current  implementation  is 
modified,  formal  subjective  testing  (diagnostic 
rhyme  test)  should  be  carried  out  with  LPC  processed 
speech  plus  noise  and  LPC  processed  noise  cancelled 
speech.  This  testing  would  provide  a  more  firm  basis 
for  describing  the  performance  of  the  noise  cancellor 

2.  The  addition  of  an  array  processor  on  the  Data 
General  Eclipse  computer  could  greatly  increase  the 
speed  of  complex  array  processing.  This  noise 
cancellor  would  have  to  be  modified  to  allow  the 
array  processor  to  perform  arithmetic  operations 

on  complex  arrays. 

3.  The  addition  of  more  memory  allocated  for  programs 
may  make  it  feasible  to  store  FFTs  in  extended 
memory  instead  of  on  disk.  Thus,  the  program  would 
not  have  to  perform  input/output  operations  to  disk 
which  take  27.8%  of  the  current  processing  time. 

4.  This  noise  cancellor's  FFT  size  of  1024  points 
could  be  halved  to  512  points  and  cut  in  half  the 
processing  time.  But,  it  must  be  determined  whether 
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or  not  the  decreased  resolution  in  frequency 
reduces  the  noise  cancellor's  ability  to  reject 
single  interfering  tones. 

5.  Spectral  smoothing  of  the  power  spectrum  estimates 
would  probably  improve  the  quality  of  the  noise 
cancelled  speech.  This  could  be  accomplished  by 
taking  partial  sums  of  past  and  present  power 
spectrum  estimates. 

6.  The  current  implementation  requires  an  external 
noise  reference  channel.  This  implementation 
could  be  modified  to  update  the  estimate  of  the 
power  spectrum  of  the  noise  on  speaker  silent 
segments  from  the  speech  plus  noise  channel.  This 
requires  detection  of  the  speaker  silent  segments 
by  thresholding.  The  threshold  is  chosen  by 
studying  four  second  histograms  of  speech  plus 
noise  which  are  bimodal.  A  threshold  is  picked 
from  between  the  two  modes. 
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Software 

Noise  Cancellation  Process 

The  noise  cancellation  process  consists  of  three 
programs  executed  in  sequence.  First,  BEGNCL  obtains 
names  of  I/O  files  and  number  of  analysis  frames  per 
estimation  from  user  and  stores  them  for  the  following 
programs  to  read.  Second,  NCANCEL  performs  the  calcu¬ 
lations  necessary  to  produce  the  (correction  signal) 
time  domain  estimate  of  the  noise.  Thirdly,  SUBTRACT 
subtracts  the  time  domain  estimate  of  the  noise  from  the 
speech  plus  noise  file.  This  entire  process  is  executed 
by  running  macrofile  NC.MC.  Figure  14  is  a  flowchart  of 
the  overall  process.  The  following  source  listings 
explain  each  program. 
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Input  File 
Names  and 
Frames/ 
Estimation 


Set  Up 
Arrays 
and 

I/O  Channels 


Calculate 
Power  Spectrum 
Estimate 
of  Noise 


Reconstruct 
Filtered  Noise 
by  Overlap 
and  Add 
Method 


Initialize  Hamming 
Window  Array 
and  Clear 
Other  Arrays 


Calculate  and  Store 
FFTs  of 

Windowed  Speech 
Plus  Noise 


Processing 

.Complete 


Initialize  Extended 
Memory  with 
Speech  Plus  Noise 
and  Noise 


Calculate  Power 
Spectrum  Estimate 
of  Speech 
Plus  Noise 


Output 


Filtered  Noise 


Calculate 
and  Store 
FFTs  of 
Windowed  Noise 


Calculate  Wiener 
Filter,  Filter 
Itoise,  and  Inverse 
FFT  Filtered  Noise 


Subtract  Estimate 
of  Noise 
from  Speech 
and  Noise 


Output 

Noise 

Cancelled 


Speech 


Figure  14.  Flowchart  of  Implementation  of  Noise  Cancel lor 
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DO  2  1=1/256  ; subtract  with  shift  of  224  on  speech  plus  noise 


Noise  Generation 


The  following  source  listing  for  NOISE  explains  the 
noise  generation  process. 
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DOUBLE  PRECISION  A,  IX,M,IY 

INTEGER  XOUT( 15972)  ) array  containing  noise  for  output 
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Adding  Noise  to  Speech 


The  following  source  listing  for  SPLUSN  explains  the 
process  of  adding  noise  to  speech  with  a  predefined  signal 
to  noise  ratio. 
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The  Use  of  the  ILS  System 


The  use  of  the  ILS  system  to  perform  linear  predictive 
analysis/synthesis  of  a  speech  file  is  done  by  executing 
the  following  steps. 


1.  2ILSFIL 


2.  FIL  WD # 

3.  INA 


4.  API  N1,N2 


5.  FIL  U# 


-  The  system  asks  you  for  the  name 
of  CHOPS  file,  the  name  of  ILS 
file  to  be  created  (WD#)-  #  is  a 

file  number  desired. 

Necessary  to  designate  WD#  as 
primary  file. 

Initializes  the  LPC  analysis 
requirements.  The  system  will 
ask  you  for  the  values  of  these 
requirements . 

Performs  the  LPC  analysis  from 
frame  N1  to  N2.  N1  must  be  greater 
than  or  equal  to  3.  The  API  command 
takes  speech  information  from  the 
primary  file  WD#  and  stores  the 
analysis  parameters  in  a  secondary 
file. 

Unprotects  file  WD#  so  that  the 
synthesis  program  can  take  the 
synthesized  speech  and  store  it 
back  into  the  primary  file  WD# . 
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Performs  synthesis  of  speech  from 
parameters  stored  in  secondary  file 
Program  which  multiplies  resulting 
speech  file  by  a  gain  factor  to 
prevent  clipping  when  speech  is 
outputted  through  the  D/A  converter 
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